PDF stage 2.2: glyph advances & metrics#539
Merged
Merged
Conversation
fe34d47 to
8afe023
Compare
7b46376 to
e889169
Compare
04283f1 to
1b1ed2e
Compare
Parse font glyph widths and advance the text matrix per glyph, on top of 2.1's placed-text emission, so segments, TJ kerning and lines land in the right place. - Font metrics (pdf_document_parser, Font): /FirstChar + /Widths + /FontDescriptor /MissingWidth (simple), /W + /DW (descendant CIDFont, both `c [w...]` and `c_first c_last w` forms). Font::advance_width(code) returns the advance in text-space units with the MissingWidth/DW fallbacks; code_byte_width() is 1 (simple) / 2 (composite). - Advance application (extract_text, GraphicsState::advance_text): emit one TextElement per shown segment (one Tj/'/", or one string of a TJ array); after each, advance Tm by sum(width*Tfs + Tc [+ Tw for single-byte 0x20]) * Th, and translate Tm by -n/1000*Tfs*Th for a TJ number. The element carries its total advance; per-glyph placement stays re-derivable from font->advance_width, keeping the run-vs-glyph choice in the renderer. Out of scope (later): intra-segment glyph shaping (stage 3), AFM widths for non-embedded standard-14 fonts (stage 3), vertical writing advances (2.6). Tests: composite /W+/DW and simple /Widths+/MissingWidth parsing asserted through advance_width; extract_text advance coverage (simple widths, TJ adjustment, char/word spacing, composite /DW, advance_width fallbacks). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9f6baa6 to
256b988
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 256b988f1b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
segment_advances now returns each code's advance alongside the total and takes a Font reference (called only when a font is present). TextElement carries the per-code advances vector so renderers need no re-derivation from font->advance_width. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 Generated with Claude Code
Stacked on #538 (stage 2.1) — base is
pdf-text-transforms; retarget tomainonce 2.1 merges.Second slice of stage 2. Parses font glyph widths and advances the text matrix
per glyph on top of 2.1's placed-text emission, so segments,
TJkerning andlines land in the right place. Per the agreed architecture, the run-vs-glyph
choice stays in the renderer.
What's in here
pdf_document_parser,Font):/FirstChar+/Widths+/FontDescriptor/MissingWidth(simple);/W+/DWfrom the descendantCIDFont (both
c [w…]andc_first c_last wforms, with a range guard).Font::advance_width(code)returns the advance in text-space units with the/MissingWidth//DWfallbacks;code_byte_width()is 1 (simple) / 2(composite, the Identity-H/V case).
extract_text,GraphicsState::advance_text): aTextElementis now emitted per shown segment (oneTj/'/", or onestring of a
TJarray). After each segmentTmadvances byΣ(width × Tfs + Tc [+ Tw for single-byte 0x20]) × Th, and aTJnumbertranslates
Tmby−n/1000 × Tfs × Th. The element carries its total advance;a renderer wanting per-glyph placement re-derives per-code advances from
font->advance_width.Out of scope (later)
Intra-segment glyph shaping (the browser lays a segment out in a fallback font
until the embedded font lands — stage 3), AFM widths for the non-embedded
standard-14 fonts (stage 3), and vertical writing-mode advances (stage 2.6).
Precise baseline placement (needs ascent metrics) also remains deferred.
Tests
pdf_document_parser.cpp— composite/W+/DWand a simple/FirstChar//Widths//MissingWidthfont, asserted throughadvance_width.pdf_page_text.cpp— simple/Widthsadvancing a following show,TJemittingper string with the numeric adjustment applied, char spacing, word spacing on
the single-byte space, the composite 2-byte
/DWadvance, and theadvance_widthfallbacks.HtmlOutputTests.